Urban Greenspace and Depression Prevalence in Chicago¶

In this notebook, I will explore how greenspace affects depression rates across Chicago at the census tract level. To perform this analysis, I will download census tract boundaries and depression rate data from the CDC PLACES dataset and compute NDVI statistics for each tract using NAIP imagery hosted by the Microsoft Planetary Computer. Because NAIP imagery is so high resolution, I will compute the statistics on Microsoft servers and then download the results using their STAC api and a VSI connection. The relationship between greenspace and depression rates will be assessed using linear regression.

Step 1: Import Packages¶

In [1]:
### Import libraries

# data directories/paths and organization
import os
import pathlib

# see warnings
import warnings

# packages to work with various file types and perform operations
# dataframes, arrays, GeoTIFFs, etc
import xarray as xr #arrays
import rioxarray as rxr #geographic arrays (i.e. TIFF files)
import rioxarray.merge as rxrmerge
import pandas as pd # dataframes
import numpy as np # mathematical operations on the data
import geopandas as gpd # geodataframes

# visualition packages
import geoviews as gv
import holoviews as hv
import hvplot.pandas
import hvplot.xarray
from cartopy import crs as ccrs
import shapely
import time

# data exploration and visualization
import matplotlib
import scipy.stats as stats
import matplotlib.pyplot as plt

# image processing
from scipy.ndimage import convolve
from scipy.ndimage import label

# cloud data
# This package lets us connect to the Microsoft Planetary Computer and perform operations on their servers
import pystac_client

# Import Sodapy and Socrata Packages
### If you haven't installed sodapy to your environment yet, you'll need to do so.
# There are instructions on performing this step later in the notebook
from sodapy import Socrata

# progress bar to monitor long downloads/processes
from tqdm.notebook import tqdm

# OLS regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
In [2]:
### Create reproducible file paths

# set a data directory path
data_dir = os.path.join(pathlib.Path.home(), 
                        
                        # these folders are applicable to my file structure; you may want to change them to fit yours
                        "Documents", 
                        "Graduate_School",
                        "EDA_Certificate", 
                        "Spring",
                        "data",
                        "01-urban-greenspace",
                        #'portfolio-post'
                        )

# Make the directory
os.makedirs(data_dir, exist_ok=True)
In [3]:
### revent GDAL from quitting due to momentary disruptions
os.environ["GDAL_HTTP_MAX_RETRY"] = "5"
os.environ["GDAL_HTTP_RETRY_DELAY"] = "1"

Step 2: Download census tract data and create a site map¶

In this step, census tract data will be downloaded from the CDC.

In [4]:
### Set up the census tract path
tract_dir = os.path.join(data_dir, "chicago_tract")

# Make the directory
os.makedirs(tract_dir, exist_ok=True)

# Make a path to the tract shapefile
tract_path = os.path.join(tract_dir, "chicago_tract.shp")

### Download the census tracts (only once)
if not os.path.exists(tract_path):
    tract_url = ('https://data.cdc.gov/download/x7zy-2xmx/application%2Fzip')
    tract_gdf = gpd.read_file(tract_url)
    chicago_gdf = tract_gdf[tract_gdf.PlaceName == "Chicago"] # subset the tract gdf to Chicago
    chicago_gdf.to_file(tract_path, index=False) # save the file for later use
else:
    ### Load in census tract data
    chicago_gdf = gpd.read_file(tract_path)
    print("File has already been downloaded! It has been loaded into 'chicago_gdf'.")
File has already been downloaded! It has been loaded into 'chicago_gdf'.
In [5]:
# Check out the gdf
chicago_gdf
Out[5]:
place2010 tract2010 ST PlaceName plctract10 PlcTrPop10 geometry
0 1714000 17031010100 17 Chicago 1714000-17031010100 4854 POLYGON ((-9758835.381 5164429.383, -9758837.3...
1 1714000 17031010201 17 Chicago 1714000-17031010201 6450 POLYGON ((-9760143.496 5163888.741, -9760143.4...
2 1714000 17031010202 17 Chicago 1714000-17031010202 2818 POLYGON ((-9759754.212 5163883.196, -9759726.6...
3 1714000 17031010300 17 Chicago 1714000-17031010300 6236 POLYGON ((-9758695.229 5163870.91, -9758695.78...
4 1714000 17031010400 17 Chicago 1714000-17031010400 5042 POLYGON ((-9757724.634 5160715.939, -9757742.2...
... ... ... ... ... ... ... ...
804 1714000 17031835700 17 Chicago 1714000-17031835700 0 POLYGON ((-9754496.811 5132719.307, -9754491.8...
805 1714000 17031980000 17 Chicago 1714000-17031980000 0 POLYGON ((-9788665.342 5161808.277, -9788667.6...
806 1714000 17031980100 17 Chicago 1714000-17031980100 0 POLYGON ((-9766922.964 5128945.613, -9766938.5...
807 1714000 17043840000 17 Chicago 1714000-17043840000 0 POLYGON ((-9787210.28 5154713.902, -9787240.67...
808 1714000 17043840801 17 Chicago 1714000-17043840801 0 POLYGON ((-9787431.695 5154736.805, -9787441.0...

809 rows × 7 columns

In [6]:
chicago_gdf.tract2010.dtype
Out[6]:
dtype('O')
In [7]:
### Site plot -- Census tracts with satellite imagery in the background
chicago_tracts = (chicago_gdf
 # Set crs
 .to_crs(ccrs.Mercator())
 .hvplot(
    line_color = 'Red', fill_color = 'None',
    crs = ccrs.Mercator(), tiles = 'EsriImagery',
    frame_width=600,
    title='Census Tracts of Chicago',
    xlabel='Longitude',
    ylabel='Latitude'
    )
)

# Save figure
# hv.save(chicago_tracts, 'census_tracts_chicago.html')

# Show the figure
chicago_tracts
Out[7]:

Step 3: Download health data from CDC Places dataset¶

In this step, depression rate data at the census tract level will be downloaded from the CDC Places dataset. I will use the Socrata API to complete the download.

Step 3a: Download the dataset¶

In [8]:
# Install sodapy

# uncomment if sodapy needs to be installed
#!pip install sodapy
In [9]:
### Set up a path for the depression data
depr_dir = os.path.join(data_dir, "depression_data")
os.makedirs(depr_dir, exist_ok=True)
depr_path = os.path.join(depr_dir, "depression.csv")
In [10]:
# Play around with using this method first
if not os.path.exists(depr_path):

    # Use Socrata to download depression data
    client = Socrata("data.cdc.gov", None)

    # Filter for the parameters we need (Depression rates in 2023, in Cook County, IL)
    results = client.get("cwsq-ngmh", year='2023', stateabbr="IL", countyname="Cook", measureid="DEPRESSION", limit=1500)
    cdc_df = pd.DataFrame.from_records(results)

    # rename and select only important columns
    depr_df = (cdc_df
               .rename(columns={
                   'data_value': 'depression',
                   'low_confidence_limit': 'depression_ci_low',
                   'high_confidence_limit': 'depression_ci_high',
                   'locationname': 'tract'})
                
                # Select the columns to keep
                [[
                    'year', 'tract',
                    'depression', 'depression_ci_low', 'depression_ci_high',
                    'data_value_unit',
                    'totalpopulation', 'totalpop18plus']]
    )
    # download to a csv
    depr_df.to_csv(depr_path)
else:
    depr_df = pd.read_csv(depr_path)
    print("The file has already been downloaded! It has been loaded into 'depr_df'.")

# Check that the download worked
depr_df
The file has already been downloaded! It has been loaded into 'depr_df'.
Out[10]:
Unnamed: 0 year tract depression depression_ci_low depression_ci_high data_value_unit totalpopulation totalpop18plus
0 0 2023 17031080300 22.2 19.7 24.7 % 5499 5071
1 1 2023 17031030703 19.1 16.8 21.4 % 3075 2829
2 2 2023 17031160400 20.1 17.8 22.4 % 4854 3956
3 3 2023 17031081401 18.6 16.4 20.9 % 2508 2394
4 4 2023 17031062900 22.3 19.8 25.0 % 4125 3635
... ... ... ... ... ... ... ... ... ...
1323 1323 2023 17031835000 18.7 16.4 20.9 % 6398 4514
1324 1324 2023 17031839700 17.9 15.6 20.0 % 4545 3732
1325 1325 2023 17031836800 22.0 19.4 24.5 % 2645 1983
1326 1326 2023 17031842000 17.7 15.5 19.8 % 2590 2456
1327 1327 2023 17031834500 21.5 19.0 24.0 % 1765 1216

1328 rows × 9 columns

In [11]:
# check the depression dty
print(depr_df.depression.dtype)

# Convert to a float for later analyses
depr_df.depression = depr_df.depression.astype('float')

# Check that
print(depr_df.depression.dtype)
float64
float64

Step 3b: Merge the dataset w/ the tract gdf¶

In [12]:
### Align tract identifier datatype for merging
chicago_gdf.tract2010 = chicago_gdf.tract2010.astype('int64')
depr_df.tract = depr_df.tract.astype('int64')

### Merge census data with geometry
tract_depr_gdf = (
    chicago_gdf
    .merge(depr_df,
           
           # specify which column to merge on as the column names are different
           left_on = 'tract2010',
           right_on = 'tract',

           # inner join
           how = 'inner'
           )
)

# check to make sure the merge worked
tract_depr_gdf
Out[12]:
place2010 tract2010 ST PlaceName plctract10 PlcTrPop10 geometry Unnamed: 0 year tract depression depression_ci_low depression_ci_high data_value_unit totalpopulation totalpop18plus
0 1714000 17031010100 17 Chicago 1714000-17031010100 4854 POLYGON ((-9758835.381 5164429.383, -9758837.3... 98 2023 17031010100 20.3 18.0 22.7 % 4905 4083
1 1714000 17031010201 17 Chicago 1714000-17031010201 6450 POLYGON ((-9760143.496 5163888.741, -9760143.4... 26 2023 17031010201 20.6 18.2 23.0 % 6939 5333
2 1714000 17031010202 17 Chicago 1714000-17031010202 2818 POLYGON ((-9759754.212 5163883.196, -9759726.6... 34 2023 17031010202 19.2 16.9 21.4 % 2742 2296
3 1714000 17031010300 17 Chicago 1714000-17031010300 6236 POLYGON ((-9758695.229 5163870.91, -9758695.78... 109 2023 17031010300 19.2 17.0 21.6 % 6305 5606
4 1714000 17031010400 17 Chicago 1714000-17031010400 5042 POLYGON ((-9757724.634 5160715.939, -9757742.2... 72 2023 17031010400 23.5 20.8 26.2 % 5079 4742
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
783 1714000 17031770700 17 Chicago 1714000-17031770700 0 POLYGON ((-9780753.304 5157066.079, -9780752.0... 832 2023 17031770700 20.3 17.9 22.7 % 2537 2093
784 1714000 17031770800 17 Chicago 1714000-17031770800 0 POLYGON ((-9783235.84 5154620.343, -9783211.23... 762 2023 17031770800 20.4 18.0 22.9 % 5661 4457
785 1714000 17031805600 17 Chicago 1714000-17031805600 0 POLYGON ((-9776210.02 5161605.738, -9776213.47... 964 2023 17031805600 19.5 17.2 21.6 % 4710 3420
786 1714000 17031807900 17 Chicago 1714000-17031807900 0 POLYGON ((-9768609.902 5160576.634, -9768654.5... 732 2023 17031807900 16.7 14.7 18.8 % 4201 3396
787 1714000 17031808100 17 Chicago 1714000-17031808100 0 MULTIPOLYGON (((-9774480.671 5161127.722, -977... 707 2023 17031808100 17.2 15.1 19.5 % 4010 3648

788 rows × 16 columns

In [13]:
### Plot depression data as chloropleth

# chloropleth is colorcoded map
chicago_depr_chloro = (
    # Load basemap w/ satellite imagery
    gv.tile_sources.EsriImagery
    
    *
    
    # map census tracts w/ depression imagery
    gv.Polygons(
        # reproject to ensure CRSs are aligned
        tract_depr_gdf.to_crs(ccrs.Mercator()),

        # select columns
        vdims = ['depression', 'tract2010'],

        # double check to ensure CRSs align
        crs = ccrs.Mercator()
    ).opts(color = 'depression', colorbar = True, 
           tools = ['hover'], clabel = "depression Rate")
).opts(
    width = 600, height = 600, 
    xaxis = None, yaxis = None,
    title = "Depression Density per Census Tract in Chicago")

# Save figure
# hv.save(chicago_depr_chloro, 'chicago_depression_chloropleth.html')

# Show figure
chicago_depr_chloro
Out[13]:

The chloropleth plot of depression rates per census tract across Chicago shows the greatest concentration (shown in the darker blue) in the northeast section of the city. There are other notable concentrations of high rates, generally in the southwest corner, though also in the central section. Rates are generally low in the central and southern sections of the city.

Step 4: Get URLS for NAIP imagery¶

In this step, I will prepare to process the NDVI statistics by first obtaining the URL(s) for the NAIP imagery that corresponds to each census tract. This list of URLs will then be used to compute the NDVI statistics.

In [14]:
### Connect to the planetary computer catalog

e84_catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1"
)
In [15]:
### Convert geometry to lat/lon for STAC
tract_latlon_gdf = chicago_gdf.to_crs(4326)

### Define a path to save NDVI stats
ndvi_dir = os.path.join(data_dir, "ndvi_data")
os.makedirs(ndvi_dir, exist_ok=True)
naip_image_url_path = os.path.join(ndvi_dir, "chicago-naip-imagery-urls.csv")

# initialize empty list for census tract IDs
downloaded_tracts = []
# Make a list of NAIP sceneu urls to loop through
scene_dfs = []

### Check for existing data - do not access duplicate tracts

# conditional to only download URLs once
if os.path.exists(naip_image_url_path):
    scene_df = pd.read_csv(naip_image_url_path)
    print("Census tract imagery URLs have already been downloaded. URLs have been stored in 'scene_df")

    # Fill list w/ tract values
    downloaded_tracts = scene_df.tract.values
else:
    print("No census tracts downloaded yet.")

    ### Loop through each census tract
    
    # Loop through each tract value in the gdf
    for i, tract_values in tqdm(tract_latlon_gdf.iterrows()):

        # Extract the tract ID for each i
        tract_values = tract_latlon_gdf.iloc[i] # not sure if this is necessary
        tract = tract_values.tract2010

        ### Check if statistics are already downloaded for this tract
        if not (tract in downloaded_tracts):
            
            ### Repeat up to 5 times in case of a momentary disruption 
            i = 0
            retry_limit = 5
            while i < retry_limit:          

                ### Try accessing the STAC
                try:

                    # Search for naip tiles
                    naip_search = e84_catalog.search(
                        # Identify product we're interested in
                        collections = ['naip'],
                        # imagery that intersects w/ our tracts
                        intersects = shapely.to_geojson(tract_values.geometry),
                        # from 2021
                        datetime = '2021'
                    )

                    ### Build dataframe with tracts and tile urls
                    scene_dfs.append(pd.DataFrame(dict(
                        
                        # tract column
                        tract = tract,
                        # date column
                        date = [pd.to_datetime(scene.datetime).date()
                                # return the scene
                                for scene in naip_search.items()],
                        # make rgbir_href
                        rgbir_href = [scene.assets['image'].href for scene in naip_search.items()]
                        ))
                    )

                    # exit retry loop once STAC request is successful
                    break

                ### Try again in case of an APIError
                except pystac_client.exceptions.APIError:
                    print(
                        f'Could not connect with STAC server.'
                        f'Retrying tract {tract}...')
                    
                    # wait 2 seconds between tries
                    time.sleep(2)
                    i += 1
                    continue


    ### Concatenate the url dataframes
    if scene_dfs:

        # concatenate
        scene_df = pd.concat(scene_dfs).reset_index(drop=True)

    else:
        scene_df = None
        print('URL downloads were unsuccessful.')

    ### Preview the url dataframe
    scene_df
    # Save tract imagery urls
    scene_df.to_csv(naip_image_url_path, index=False)

    # Confirm that imagery URLs have been downloaded and saved
    print("Census tract imagery URLs have been downloaded.")
    scene_df
# scene_df
Census tract imagery URLs have already been downloaded. URLs have been stored in 'scene_df

Step 5: Calculate NDVI statistics and download results¶

In this step, I will compute NDVI statistics on the the Microsoft Planetary Computer's servers and then download the data to a variable in the notebook environment. The statistics calculated will be: fraction vegetated, or fraction of the census tract that has vegetated (green) landcover; edge density, or how divided up or cohesive the greenspace in a tract is; and mean patch size, which is the average size of a greenspace in the given tract. To identify edges for the edge density, I will perform a convolution using a 3x3 kernel.

In [16]:
# Make CSV for NDVI statistics
ndvi_path_veg = os.path.join(ndvi_dir, 'chicago_ndvi_stats_veg.csv')
In [17]:
### make sure to explain what each step is doing

# big conditional to ensure we only download data once and the whole notebook can be run
if not os.path.exists(ndvi_path_veg):

    # Message to clearly indicate status
    print("NDVI statistics have not been downloaded yet. Downloading now...")

    ### Skip this step if data are already downloaded
    if not scene_df is None: 

        ### Loop through the census tracts with URLs
        for tract, tract_date_gdf in tqdm(scene_df.groupby('tract')):

            ### Open all images for tract
            # create a list to store NDVI array; 1 array per NAIP image
            tile_das = []

            # Try/Except block to prevent tract ID; I used Gemini to help me identify the issue here

            # iterate over rows in tract-specific DF
            for _, href_s in tract_date_gdf.iterrows():

                ### Open vsi connection to data
                tile_da = rxr.open_rasterio(
                    
                    # mask and squeeze
                    href_s.rgbir_href, masked=True).squeeze()
                
                ### Clip data (1. get boundary 2. crop to bounding box 3. clip to exact geoms)
                # get tract boundary
                boundary = (
                    chicago_gdf
                    # deal with integer issues
                    .assign(tract2010 = lambda df: df['tract2010'].astype(str))
                    # use tractID as the index
                    .set_index('tract2010')
                    # get the tract in question
                    .loc[[str(tract)]]
                    # make sure CRSs match
                    .to_crs(tile_da.rio.crs)
                    # and take the geometry of that boundary
                    .geometry
                )

                # crop to a bounding box
                crop_da = tile_da.rio.clip_box(
                    # get coordinates of bounding box
                    *boundary.envelope.total_bounds,
                    # ensure any pixels right on the boundary are included
                    auto_expand = True
                )

                # clip to actual tract geometry (again, this two-step process just makes it faster)
                clip_da = crop_da.rio.clip(
                    boundary, 
                    # any pixels on boundary are included
                    all_touched=True)
                    
                # Compute NDVI for each image within the boundary
                # NDVI = (NIR-R)/(NIR+R)
                ndvi_da = (
                    (clip_da.sel(band = 4) - clip_da.sel(band = 1))
                    / (clip_da.sel(band = 4) + clip_da.sel(band = 1))
                )

                ### Accumulate result
                tile_das.append(ndvi_da)

            ### Merge data
            scene_da = rxrmerge.merge_arrays(tile_das)

            ### Mask vegetation
            veg_mask = (scene_da > 0.3) # we've picked NDVI values that are >0.3 for pixels to be counted as "vegetated"

            ### Calculate statistics and save data to file
            # count all the valid pixels and all the vegetated pixels in the tract 
            total_pixels = scene_da.notnull().sum()
            veg_pixels = veg_mask.sum()

            ### Identify vegetation patches by labeling each patch of vegetation
            labeled_patches, num_patches = label(veg_mask)

            # Count pixels in each patch, then calculate the mean patch size for each tract
            patch_sizes = np.bincount(labeled_patches.ravel())[1:]
            mean_patch_size = patch_sizes.mean()

            ### Calculate edge density
            # define kernel
            kernel = np.array([
                [1, 1, 1],
                [1, -8, 1],
                [1, 1, 1]])
            
            # detect boundaries between veg and non-veg
            edges = convolve(veg_mask, kernel, mode = 'constant')

            # calculate proportion of edge pixels by tract area
            edge_density = np.sum(edges != 0) / veg_mask.size
            
            ### Now, save all of these statistics we've calculated:
            ### Add a row to the statistics file that captures all of the tracts
            # create new DF for the tract and append it to the CSV
            
            pd.DataFrame(dict(           
                # Store tract ID
                tract = [tract],
                # store total pixel count as an integer
                total_pixels = [int(total_pixels)],
                # store fractions of vegetated pixels (have to use float b/c not whole number)
                frac_veg = [float(veg_pixels / total_pixels)],
                # store mean patch size
                mean_patch_size = [mean_patch_size],
                # store edge density
                edge_density = [edge_density]

            # write stats DF to the CSV we've established
            )).to_csv(
                ndvi_path_veg,
                # Append the data
                mode = 'a',
                # get rid of row numbers
                index = False,
                # Write column names but only if they haven't been written yet
                header = (not os.path.exists(ndvi_path_veg))
            )

    ### Re-load results from file
    ndvi_stats_df = pd.read_csv(ndvi_path_veg)
    print("The NDVI Statistics have been downloaded! They have been loaded into 'ndvi_stats_df'")

else:
    ndvi_stats_df = pd.read_csv(ndvi_path_veg)
    print("The NDVI Statistics have already been downloaded! They have been loaded into 'ndvi_stats_df'")
The NDVI Statistics have already been downloaded! They have been loaded into 'ndvi_stats_df'
In [18]:
# Check out the NDVI statistics to make sure everything worked
ndvi_stats_df
Out[18]:
tract total_pixels frac_veg mean_patch_size edge_density
0 17031010100 1059033 0.178657 55.225919 0.118612
1 17031010201 1531554 0.213859 57.543394 0.161668
2 17031010202 978546 0.186055 63.260250 0.123673
3 17031010300 1308392 0.191675 57.113642 0.126384
4 17031010400 1516964 0.198563 52.983817 0.079474
... ... ... ... ... ...
804 17031843900 4521964 0.199113 23.985215 0.124282
805 17031980000 87088150 0.006512 4.424265 0.017816
806 17031980100 8386263 0.007853 6.411604 0.015556
807 17043840000 17789223 0.012904 6.498868 0.028728
808 17043840801 463264 0.006148 2.859438 0.019183

809 rows × 5 columns

Step 6: Explore Data¶

In this step, I will create side-by-side chloropleth plots to visually compare depression rates and the various greenspace statistics.

In [19]:
### Merge census data with geometry
chi_ndvi_depr_gdf = (

    # Grab census tract gdf
    tract_depr_gdf

    # merge w/ NDVI df
    .merge(
        ndvi_stats_df,

        # match on the tract ID
        # left: tract2010
        # right: tract
        left_on = 'tract2010',
        right_on = 'tract',
        how = 'inner'
    )
)

# check it out
chi_ndvi_depr_gdf
Out[19]:
place2010 tract2010 ST PlaceName plctract10 PlcTrPop10 geometry Unnamed: 0 year tract_x ... depression_ci_low depression_ci_high data_value_unit totalpopulation totalpop18plus tract_y total_pixels frac_veg mean_patch_size edge_density
0 1714000 17031010100 17 Chicago 1714000-17031010100 4854 POLYGON ((-9758835.381 5164429.383, -9758837.3... 98 2023 17031010100 ... 18.0 22.7 % 4905 4083 17031010100 1059033 0.178657 55.225919 0.118612
1 1714000 17031010201 17 Chicago 1714000-17031010201 6450 POLYGON ((-9760143.496 5163888.741, -9760143.4... 26 2023 17031010201 ... 18.2 23.0 % 6939 5333 17031010201 1531554 0.213859 57.543394 0.161668
2 1714000 17031010202 17 Chicago 1714000-17031010202 2818 POLYGON ((-9759754.212 5163883.196, -9759726.6... 34 2023 17031010202 ... 16.9 21.4 % 2742 2296 17031010202 978546 0.186055 63.260250 0.123673
3 1714000 17031010300 17 Chicago 1714000-17031010300 6236 POLYGON ((-9758695.229 5163870.91, -9758695.78... 109 2023 17031010300 ... 17.0 21.6 % 6305 5606 17031010300 1308392 0.191675 57.113642 0.126384
4 1714000 17031010400 17 Chicago 1714000-17031010400 5042 POLYGON ((-9757724.634 5160715.939, -9757742.2... 72 2023 17031010400 ... 20.8 26.2 % 5079 4742 17031010400 1516964 0.198563 52.983817 0.079474
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
783 1714000 17031770700 17 Chicago 1714000-17031770700 0 POLYGON ((-9780753.304 5157066.079, -9780752.0... 832 2023 17031770700 ... 17.9 22.7 % 2537 2093 17031770700 5547730 0.022761 7.584359 0.020706
784 1714000 17031770800 17 Chicago 1714000-17031770800 0 POLYGON ((-9783235.84 5154620.343, -9783211.23... 762 2023 17031770800 ... 18.0 22.9 % 5661 4457 17031770800 174812 0.138652 6.197392 0.172915
785 1714000 17031805600 17 Chicago 1714000-17031805600 0 POLYGON ((-9776210.02 5161605.738, -9776213.47... 964 2023 17031805600 ... 17.2 21.6 % 4710 3420 17031805600 463 0.000000 NaN 0.000000
786 1714000 17031807900 17 Chicago 1714000-17031807900 0 POLYGON ((-9768609.902 5160576.634, -9768654.5... 732 2023 17031807900 ... 14.7 18.8 % 4201 3396 17031807900 26883 0.079121 23.898876 0.101617
787 1714000 17031808100 17 Chicago 1714000-17031808100 0 MULTIPOLYGON (((-9774480.671 5161127.722, -977... 707 2023 17031808100 ... 15.1 19.5 % 4010 3648 17031808100 44540 0.303839 19.785088 0.154949

788 rows × 21 columns

In [20]:
# Define a plot_chloropleth function to be used later

def plot_chloropleth(gdf, **opts):

    # docstring to describe the function
    """Generate a chloropleth with the given color column"""

    # use geoviews to make a polygon object
    return gv.Polygons(

        # use Mercator
        gdf.to_crs(ccrs.Mercator()),

        # set plot crs to Mercator
        crs = ccrs.Mercator()
    
    # drop x and y axis (unnecessary) and add a legend
    ).opts(xaxis = None, yaxis = None, colorbar = True, **opts)
In [21]:
# Apply function to chicago data; plot edge density
depr_edge_chloro = (
    plot_chloropleth(

        # plot depression by census tract
        chi_ndvi_depr_gdf, color = 'depression', cmap = 'plasma', clabel = 'Depression Rate')
        .opts(width = 400, title = 'Depression Rate per Census Tract')
    
    # add a second plot w/ vegetation edge density
    + 
    plot_chloropleth(
        # plot vegetation edge density by census tract
        chi_ndvi_depr_gdf, color = 'edge_density', cmap = 'Greens', clabel = 'Edge Density')
        .opts(width = 400, title = 'Edge Density per Census Tract')
)

# Save plot
# hv.save(depr_edge_chloro, 'depression_edge_chloropleth.html')

# Show plot
depr_edge_chloro
Out[21]:
In [22]:
# Plot Mean Patch Size
depr_patch_chloro = (
    plot_chloropleth(

        # plot depression by census tract
        chi_ndvi_depr_gdf, color = 'depression', cmap = 'plasma', clabel = 'Depression Rate')
        .opts(width = 400, title = 'Depression Rate per Census Tract')
    
    # add a second plot w/ mean patch size
    + 
    plot_chloropleth(
        # plot mean patch size by census tract
        chi_ndvi_depr_gdf, color = 'mean_patch_size', cmap = 'Greens', clabel = 'Mean Patch Size')
        .opts(width = 400, title = 'Mean Patch Size per Census Tract')
)

# Save plot
# hv.save(depr_patch_chloro, 'depression_mean-patch_chloropleth.html')

# Show plot
depr_patch_chloro
Out[22]:
In [ ]:
# Fraction of Vegetation
depr_frac_veg_chloro = (
    plot_chloropleth(

        # plot depression by census tract
        chi_ndvi_depr_gdf, color = 'depression', cmap = 'plasma', clabel = 'Depression Rate')
        .opts(width = 400, title = 'Depression Rate per Census Tract')
    
    # add a second plot w/ vegetation edge density
    + 
    plot_chloropleth(
        # plot vegetation edge density by census tract
        chi_ndvi_depr_gdf, color = 'frac_veg', cmap = 'Greens', clabel = 'Vegetated Fraction')
        .opts(width = 400, title = 'Fraction of Census Tract that is Vegetated')
)

# Save plot
#hv.save(depr_frac_veg_chloro, 'depression_frac-veg_chloropleth.html')

# Show plot
depr_frac_veg_chloro
Out[ ]:

Visual comparisions of the depression rate per census tract and the vegetation statistics seem to show a positive relationship between greenspace and depression. Across all three variables, depression rates generally seem to be higher in places with higher greenspace values. However, this relationship does not hold throughout the city, as some tracts in the northwest and southwest corners have high greenspace values but do not have corresponding depression rates.

Step 7: Quanitify the relationship between greenspace and depression using linear regression¶

In this step, I will use a linear regression model to better understand how strong the relationship between greenspace and depression rates is. I will use a train-test-split method to test the data, as well as investigate spatial bias. Before performing the regression, I will investigate the data to ensure issues of collinearity and normal distribution don't interfere. Transformations of the data may be necessary.

Step 7a: Data Preparation¶

In [24]:
# investigate NaN values per column
chi_ndvi_depr_gdf.isna().any()
Out[24]:
place2010             False
tract2010             False
ST                    False
PlaceName             False
plctract10            False
PlcTrPop10            False
geometry              False
Unnamed: 0            False
year                  False
tract_x               False
depression            False
depression_ci_low     False
depression_ci_high    False
data_value_unit       False
totalpopulation       False
totalpop18plus        False
tract_y               False
total_pixels          False
frac_veg              False
mean_patch_size        True
edge_density          False
dtype: bool
In [25]:
# NaN values per row
chi_ndvi_depr_gdf.isna().any(axis = 1)
Out[25]:
0      False
1      False
2      False
3      False
4      False
       ...  
783    False
784    False
785     True
786    False
787    False
Length: 788, dtype: bool
In [26]:
# Check the prevalence of NaNs
chi_ndvi_depr_gdf.isna().sum()

# this demonstrates we'll want to just drop those two rows
Out[26]:
place2010             0
tract2010             0
ST                    0
PlaceName             0
plctract10            0
PlcTrPop10            0
geometry              0
Unnamed: 0            0
year                  0
tract_x               0
depression            0
depression_ci_low     0
depression_ci_high    0
data_value_unit       0
totalpopulation       0
totalpop18plus        0
tract_y               0
total_pixels          0
frac_veg              0
mean_patch_size       1
edge_density          0
dtype: int64
In [27]:
# Show rows with any missing values
chi_ndvi_depr_gdf[chi_ndvi_depr_gdf.isnull().any(axis=1)]
Out[27]:
place2010 tract2010 ST PlaceName plctract10 PlcTrPop10 geometry Unnamed: 0 year tract_x ... depression_ci_low depression_ci_high data_value_unit totalpopulation totalpop18plus tract_y total_pixels frac_veg mean_patch_size edge_density
785 1714000 17031805600 17 Chicago 1714000-17031805600 0 POLYGON ((-9776210.02 5161605.738, -9776213.47... 964 2023 17031805600 ... 17.2 21.6 % 4710 3420 17031805600 463 0.0 NaN 0.0

1 rows × 21 columns

In [28]:
# drop NaNs
chi_ndvi_depr_gdf = chi_ndvi_depr_gdf.dropna()

# check that
chi_ndvi_depr_gdf.isna().sum()
Out[28]:
place2010             0
tract2010             0
ST                    0
PlaceName             0
plctract10            0
PlcTrPop10            0
geometry              0
Unnamed: 0            0
year                  0
tract_x               0
depression            0
depression_ci_low     0
depression_ci_high    0
data_value_unit       0
totalpopulation       0
totalpop18plus        0
tract_y               0
total_pixels          0
frac_veg              0
mean_patch_size       0
edge_density          0
dtype: int64
In [29]:
### Select variables for a smaller dataframe: Make dataframe with the variables we want for the linear model

variables = ['frac_veg', 'depression', 'mean_patch_size', 'edge_density']
In [30]:
### Visualize distribution of data

# make a scatter plot
scatter = hvplot.scatter_matrix(
    chi_ndvi_depr_gdf[variables], title = "NDVI Statistics Scatterplot"
)

# save scatterplot
# hv.save(scatter, 'depression_greenspace_scatterplot.html')

# show the plot
scatter
Out[30]:

Scatterplots show varying colinearity among the variables. Fraction vegetated and edge density are most colinear, so we may want to leave one of those variables out. Mean patch size and fraction vegetated are also somewhat organized, but not to the degree of the previous set, so it should not pose an issue.

In [31]:
# make histograms of the data

# make facet
fig, axes = plt.subplots(2, 2, figsize = (12,10))

# we'll use the same variables to plot; set more friendly titles
variables = ['frac_veg', 'depression', 'mean_patch_size', 'edge_density']
titles = ['Vegetated Fraction', 'Adult Depression Rate', 'Mean Patch Size', 'Edge Density']

# Loop through variables and axes to plot histograms
for i, (var, title) in enumerate(zip(variables, titles)):
    ax = axes[i//2, i%2]
    ax.hist(chi_ndvi_depr_gdf[var], bins = 10, 
            color = 'gray', edgecolor = 'black')
    ax.set_title(title)
    ax.set_xlabel(var)
    ax.set_ylabel('Frequency')

# Ensure figures don't overlap
plt.tight_layout()

# Save figure
# plt.savefig('greenspace_histogram.png')

# Show the figure
plt.show()
No description has been provided for this image
In [32]:
# q-q plots

# set variables
var_qq = ['frac_veg', 'edge_density','mean_patch_size', 'depression']

# make q-q plots
plt.figure(figsize = (12, 10))
for i, var in enumerate(var_qq, 1):

    # make 2x2 facet
    plt.subplot(2, 2, i)

    # normal distribution
    stats.probplot(chi_ndvi_depr_gdf[var], dist = 'norm', plot = plt)

    # title
    plt.title(f'Q-Q Plot for {var}')

# Set layout
plt.tight_layout()

# Save plot
# plt.savefig('qq_plot.png')

# Show the plot
plt.show()
No description has been provided for this image

The histograms above show that the patch size variable has a long tail to the right and the depression variable is bimodal. The other two variables (fraction vegetated and edge density) appear to be more normally distributed. This means that we should probably log transform the depression and patch size variables.

However, the q-q plots indicate that the depression might be normally distributed, and that mean patch size and fraction vegetated are not normally distibuted. So, we'll log transform all of the variables and compare to decide which version to include in the analysis.

In [33]:
# Create a model DataFrame
model_df = (
    chi_ndvi_depr_gdf

    # copy that
    .copy()

    # subset the data
    [['frac_veg', 'depression', 'mean_patch_size', 'edge_density', 'geometry']]

    # # drop rows w/o observations
    # .dropna()
)

model_df
Out[33]:
frac_veg depression mean_patch_size edge_density geometry
0 0.178657 20.3 55.225919 0.118612 POLYGON ((-9758835.381 5164429.383, -9758837.3...
1 0.213859 20.6 57.543394 0.161668 POLYGON ((-9760143.496 5163888.741, -9760143.4...
2 0.186055 19.2 63.260250 0.123673 POLYGON ((-9759754.212 5163883.196, -9759726.6...
3 0.191675 19.2 57.113642 0.126384 POLYGON ((-9758695.229 5163870.91, -9758695.78...
4 0.198563 23.5 52.983817 0.079474 POLYGON ((-9757724.634 5160715.939, -9757742.2...
... ... ... ... ... ...
782 0.020053 19.7 13.674242 0.029123 MULTIPOLYGON (((-9787333.178 5161561.245, -978...
783 0.022761 20.3 7.584359 0.020706 POLYGON ((-9780753.304 5157066.079, -9780752.0...
784 0.138652 20.4 6.197392 0.172915 POLYGON ((-9783235.84 5154620.343, -9783211.23...
786 0.079121 16.7 23.898876 0.101617 POLYGON ((-9768609.902 5160576.634, -9768654.5...
787 0.303839 17.2 19.785088 0.154949 MULTIPOLYGON (((-9774480.671 5161127.722, -977...

787 rows × 5 columns

In [34]:
### Perform variable transformation
# the histograms above show that we should probably log transform the depression and patch size variables
# however, the q-q plots indicate that the depression might be normally distributed, and that mean patch size and 

# fraction vegetated log
model_df['log_frac_veg'] = np.log(model_df.frac_veg)

# depression log
model_df['log_depression'] = np.log(model_df.depression)

# mean patch size log
model_df['log_patch'] = np.log(model_df.mean_patch_size)

# edge density log
model_df['log_edge'] = np.log(model_df.edge_density)

# double check that those were calculated
model_df
Out[34]:
frac_veg depression mean_patch_size edge_density geometry log_frac_veg log_depression log_patch log_edge
0 0.178657 20.3 55.225919 0.118612 POLYGON ((-9758835.381 5164429.383, -9758837.3... -1.722286 3.010621 4.011432 -2.131894
1 0.213859 20.6 57.543394 0.161668 POLYGON ((-9760143.496 5163888.741, -9760143.4... -1.542437 3.025291 4.052539 -1.822208
2 0.186055 19.2 63.260250 0.123673 POLYGON ((-9759754.212 5163883.196, -9759726.6... -1.681715 2.954910 4.147257 -2.090110
3 0.191675 19.2 57.113642 0.126384 POLYGON ((-9758695.229 5163870.91, -9758695.78... -1.651954 2.954910 4.045043 -2.068434
4 0.198563 23.5 52.983817 0.079474 POLYGON ((-9757724.634 5160715.939, -9757742.2... -1.616649 3.157000 3.969987 -2.532326
... ... ... ... ... ... ... ... ... ...
782 0.020053 19.7 13.674242 0.029123 MULTIPOLYGON (((-9787333.178 5161561.245, -978... -3.909357 2.980619 2.615514 -3.536243
783 0.022761 20.3 7.584359 0.020706 POLYGON ((-9780753.304 5157066.079, -9780752.0... -3.782706 3.010621 2.026088 -3.877331
784 0.138652 20.4 6.197392 0.172915 POLYGON ((-9783235.84 5154620.343, -9783211.23... -1.975789 3.015535 1.824129 -1.754953
786 0.079121 16.7 23.898876 0.101617 POLYGON ((-9768609.902 5160576.634, -9768654.5... -2.536782 2.815409 3.173831 -2.286542
787 0.303839 17.2 19.785088 0.154949 MULTIPOLYGON (((-9774480.671 5161127.722, -977... -1.191257 2.844909 2.984929 -1.864662

787 rows × 9 columns

In [35]:
### Visualize transformed variables

var_log = ['log_frac_veg', 'log_edge', 'log_depression', 'log_patch']
hvplot.scatter_matrix(
    model_df[var_log], title = "NDVI Statistics Scatterplot after Log Transform"
)
Out[35]:
In [36]:
# make histograms of the log transformation

# make facet
fig, axes = plt.subplots(2, 2, figsize = (12,10))

# we'll use the same variables to plot; set more friendly titles
variables = ['log_frac_veg', 'log_depression', 'log_patch', 'edge_density']
titles = ['Vegetated Fraction', 'Adult Depression Rate', 'Mean Patch Size', 'Edge Density']

# Loop through variables and axes to plot histograms
for i, (var, title) in enumerate(zip(variables, titles)):
    ax = axes[i//2, i%2]
    ax.hist(model_df[var], bins = 10, 
            color = 'gray', edgecolor = 'black')
    ax.set_title(title)
    ax.set_xlabel(var)
    ax.set_ylabel('Frequency')

# Ensure figures don't overlap
plt.tight_layout()

# Save figure
# plt.savefig('greenspace_log_histogram.png')

# Show the figure
plt.show()
No description has been provided for this image
In [37]:
# q-q plots

# set variables
var_qq = ['log_frac_veg', 'log_edge','log_patch', 'log_depression']

# make q-q plots
plt.figure(figsize = (12, 10))
for i, var in enumerate(var_qq, 1):

    # make 2x2 facet
    plt.subplot(2, 2, i)

    # normal distribution
    stats.probplot(model_df[var], dist = 'norm', plot = plt)

    # title
    plt.title(f'Q-Q Plot for {var}')

# set layout
plt.tight_layout()

# save plot
# plt.savefig('log_qq_plot.png')

# Show the plot
plt.show()
No description has been provided for this image

The histograms and q-q plots show that the log transform improved most variables. All except edge density are now more normally distributed. Based on this, I will use the untransformed edge density variable and log of mean patch size, fraction vegetated, and depression variables for the regression modeling.

Step 7b: Linear Regression¶

In the final section, I will fit a linear regression to the dataset to better understand the relationship between the greenspace statistics and depression rates across Chicago. I will use test-train split (67% / 33%), setting a definite seed to ensure the work is reproducible. Because fraction vegetated and edge density were quite colinear, I will only use one for the the linear regression; in this case, I'm going to use edge density.

In [38]:
### Select predictor and outcome variables
x = model_df[['edge_density', 'log_patch']] # if desired, edge_density could be swapped for log_frac_veg.
y = model_df[['log_depression']]

### Split into training and testing datasets
x_train, x_test, y_train, y_test = train_test_split(
    
    # split up the data, reserving 33% of the data for testing and using 67% for training
    # basically, we'll keep 33% of the data back to test the accuracy of the linear regression
    x, y, test_size=0.33, 
    
    # set the state of the random split so that the same split occurs every time the code is run.
    # this is to ensure reproducibility
    random_state=19
)

### Fit a linear regression
reg = LinearRegression()

# fit to our data
reg.fit(x_train, y_train)

### Predict depression values for the test dataset
# we have to use np.exp to undo the log transformation we did earlier
y_test['pred_depr'] = np.exp(reg.predict(x_test))

# Real depression rate to compare
y_test['depression'] = np.exp(y_test.log_depression)

# Check out the comparision
y_test
Out[38]:
log_depression pred_depr depression
295 2.965273 19.908374 19.4
549 3.000720 18.960117 20.1
595 2.965273 18.796073 19.4
378 2.917771 19.079415 18.5
16 2.917771 19.214983 18.5
... ... ... ...
548 2.906901 19.226897 18.3
260 3.044522 18.577043 21.0
504 2.862201 19.359082 17.5
529 2.850707 18.561064 17.3
59 3.091042 19.258162 22.0

260 rows × 3 columns

In [39]:
### Plot measured vs. predicted depression prevalence with a 1-to-1 line
y_max = y_test.depression.max()

meas_vs_pred_plot = (
    (y_test
    .hvplot.scatter(x='depression', y='pred_depr',
                    xlabel = 'Measured Adult Depression Prevalance',
                    ylabel = 'Predicted Adult Depression Prevalance',
                    title = 'Linear Regression Performance - Test Data')
    .opts(aspect='equal', 
          xlim=(0, y_max), ylim=(0, y_max), 
          width=600, height=600)
) * hv.Slope(slope=1, y_intercept=0).opts(color='black')
)

# Save the plot
# hv.save(meas_vs_pred_plot, 'measured_vs_predicted_plot.html')

# Show the plot
meas_vs_pred_plot
Out[39]:

The plot above shows that the greenspace statistics don't have a strong relationship to the prevalance of depression across the city. If the relationship were strong, we would see the blue points plotted much closer to the line.

Step 7c: Spatial Bias¶

In this section, I will explore how the linear regression model's error varies across the city.

In [40]:
### Compute model error for all census tracts

# Use the trained model to predict depression prevalence for each census tract
model_df['pred_depr'] = np.exp(reg.predict(x))

# Calculate model error
model_df['error_depr'] = model_df['pred_depr'] - model_df['depression']

### Plot error geographically as a chloropleth
error_chloropleth = (
    plot_chloropleth(model_df, color = 'error_depr', cmap = 'RdBu', clabel = 'Error')
    
    # set frame width
    .opts(frame_width = 600, aspect = 'equal', 
          title = 'Linear Regression Error Rate across Census Tracts')
)

# Export plot
# hv.save(error_chloropleth, 'error_chloropleth.html')

# Show the plot
error_chloropleth
Out[40]:

The cloropleth plot of error across Chicago reveals a distinct pattern. In general, the model underestimates the depression rate across the northern half, while generally overestimating the southern half, though there are clear pockets of underestimation in the southern portion of the city as well.

The model error ranges by at least 10%, with greater magnitude in overestimation.

This plot reinforces that the greenspace statistics are not strongly related to depression rates in Chicago.